compact model
Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
Kotoge, Rikuto, Nishimura, Mai, Ma, Jiaxin
Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5--1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
- Leisure & Entertainment (1.00)
- Education (0.69)
- Media > Music (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Embedding-Enhanced Probabilistic Modeling of Ferroelectric Field Effect Transistors (FeFETs)
Afee, Tasnia Nobi, Hutchins, Jack, Islam, Md Mazharul, Kampfe, Thomas, Aziz, Ahmedullah
FeFETs hold strong potential for advancing memory and logic technologies, but their inherent randomness arising from both operational cycling and fabrication variability poses significant challenges for accurate and reliable modeling. Capturing this variability is critical, as it enables designers to predict behavior, optimize performance, and ensure reliability and robustness against variations in manufacturing and operating conditions. Existing deterministic and machine learning-based compact models often fail to capture the full extent of this variability or lack the mathematical smoothness required for stable circuit-level integration. In this work, we present an enhanced probabilistic modeling framework for FeFETs that addresses these limitations. Building upon a Mixture Density Network (MDN) foundation, our approach integrates C-infinity continuous activation functions for smooth, stable learning and a device-specific embedding layer to capture intrinsic physical variability across devices. Sampling from the learned embedding distribution enables the generation of synthetic device instances for variability-aware simulation. With an R2 of 0.92, the model demonstrates high accuracy in capturing the variability of FeFET current behavior. Altogether, this framework provides a scalable, data-driven solution for modeling the full stochastic behavior of FeFETs and offers a strong foundation for future compact model development and circuit simulation integration.
A Compact Model for Large-Scale Time Series Forecasting
Yeh, Chin-Chia Michael, Fan, Xiran, Jiang, Zhimeng, Fan, Yujie, Chen, Huiyuan, Saini, Uday Singh, Lai, Vivian, Dai, Xin, Wang, Junpeng, Zhuang, Zhongfang, Wang, Liang, Zheng, Yan
Spatio-temporal data, which commonly arise in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represent a special category of multivariate time series. They exhibit two distinct characteristics: high dimensionality and commensurability across spatial locations. These attributes call for computationally efficient modeling approaches and facilitate the use of univariate forecasting models in a channel-independent fashion. SparseTSF, a recently introduced competitive univariate forecasting model, harnesses periodicity to achieve compactness by concentrating on cross-period dynamics, thereby extending the Pareto frontier with respect to model size and predictive performance. Nonetheless, it underperforms on spatio-temporal data due to an inadequate capture of intra-period temporal dependencies. To address this shortcoming, we propose UltraSTF, which integrates a cross-period forecasting module with an ultra-compact shape bank component. Our model effectively detects recurring patterns in time series through the attention mechanism of the shape bank component, thereby strengthening its ability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while employing fewer than 0.2% of the parameters required by the second-best approaches, thus further extending the Pareto frontier of existing methods.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)
- Information Technology > Data Science > Data Mining (0.84)
Dissertation Machine Learning in Materials Science -- A case study in Carbon Nanotube field effect transistors
Carbon Nanotube has long been seen as a promising candidate for high-performance electronic material, yet its unique 1D structure leads to challenges in device fabrication. Many processing approaches have been proposed to produce better performing CNTFETs and this explosion of data needs an efficient way to explore.
- Semiconductors & Electronics (1.00)
- Materials (0.75)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Energy > Oil & Gas > Upstream (0.46)
Optimizing Performance: How Compact Models Match or Exceed GPT's Classification Capabilities through Fine-Tuning
Lefort, Baptiste, Benhamou, Eric, Ohana, Jean-Jacques, Saltiel, David, Guez, Beatrice
In this paper, we demonstrate that non-generative, small-sized models such as FinBERT and FinDRoBERTa, when fine-tuned, can outperform GPT-3.5 and GPT-4 models in zero-shot learning settings in sentiment analysis for financial news. These fine-tuned models show comparable results to GPT-3.5 when it is fine-tuned on the task of determining market sentiment from daily financial news summaries sourced from Bloomberg. To fine-tune and compare these models, we created a novel database, which assigns a market score to each piece of news without human interpretation bias, systematically identifying the mentioned companies and analyzing whether their stocks have gone up, down, or remained neutral. Furthermore, the paper shows that the assumptions of Condorcet's Jury Theorem do not hold suggesting that fine-tuned small models are not independent of the fine-tuned GPT models, indicating behavioural similarities. Lastly, the resulted fine-tuned models are made publicly available on HuggingFace, providing a resource for further research in financial sentiment analysis and text classification.
- Asia (0.68)
- North America > United States (0.46)
- Banking & Finance > Trading (1.00)
- Energy > Oil & Gas > Trading (0.46)
Compact Model Parameter Extraction via Derivative-Free Optimization
Martinez, Rafael Perez, Iwamoto, Masaya, Woo, Kelly, Bian, Zhengliang, Tinti, Roberto, Boyd, Stephen, Chowdhury, Srabanti
In this paper, we address the problem of compact model parameter extraction to simultaneously extract tens of parameters via derivative-free optimization. Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets, each targeting different operational regions of the device, a process that can take several days or even weeks. Our approach streamlines this process by employing derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations. We further enhance the optimization process to address critical issues in device modeling by carefully choosing a loss function that evaluates model performance consistently across varying magnitudes by focusing on relative errors (as opposed to absolute errors), prioritizing accuracy in key operational regions of the device above a certain threshold, and reducing sensitivity to outliers. Furthermore, we utilize the concept of train-test split to assess the model fit and avoid overfitting. This is done by fitting 80% of the data and testing the model efficacy with the remaining 20%. We demonstrate the effectiveness of our methodology by successfully modeling two semiconductor devices: a diamond Schottky diode and a GaN-on-SiC HEMT, with the latter involving the ASM-HEMT DC model, which requires simultaneously extracting 35 model parameters to fit the model to the measured data. These examples demonstrate the effectiveness of our approach and showcase the practical benefits of derivative-free optimization in device modeling.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Sonoma County > Santa Rosa (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (5 more...)
Fast System Technology Co-Optimization Framework for Emerging Technology Based on Graph Neural Networks
Ma, Tianliang, Fan, Guangxi, Sun, Xuguang, Deng, Zhihui, Low, Kainlu, Shao, Leilai
This paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and device architectures. We focus on accelerating the technology level of STCO using AI techniques, by employing graph neural network (GNN)-based approaches for both TCAD simulation and cell library characterization, which are interconnected through a unified compact model, collectively achieving over a 100X speedup over traditional methods. These advancements enable comprehensive STCO iterations with runtime speedups ranging from 1.9X to 14.1X and supports both emerging and traditional technologies.
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > California > San Diego County > San Diego (0.04)
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Kang, Junmo, Xu, Wei, Ritter, Alan
Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (5 more...)
Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
Zhou, Wangchunshu, Bras, Ronan Le, Choi, Yejin
Pre-trained Transformer models like T5 and BART have advanced the state of the art on a wide range of text generation tasks. Compressing these models into smaller ones has become critically important for practical use. Common neural network compression techniques such as knowledge distillation or quantization are limited to static compression where the compression ratio is fixed. In this paper, we introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression. Modular Transformers train modularized layers that have the same function of two or more consecutive layers in the original model via module replacing and knowledge distillation. After training, the modularized layers can be flexibly assembled into sequence-to-sequence models that meet different performance-efficiency trade-offs. Experimental results show that after a single training phase, by simply varying the assembling strategy, Modular Transformers can achieve flexible compression ratios from 1.1x to 6x with little to moderate relative performance drop.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas (0.04)
- North America > Dominican Republic (0.04)
- (16 more...)
On the Effectiveness of Compact Biomedical Transformers
Rohanian, Omid, Nouriborji, Mohammadmahdi, Kouchaki, Samaneh, Clifton, David A.
Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension, and number of layers. The natural language processing (NLP) community has developed numerous strategies to compress these models utilising techniques such as pruning, quantisation, and knowledge distillation, resulting in models that are considerably faster, smaller, and subsequently easier to use in practice. By the same token, in this paper we introduce six lightweight models, namely, BioDistilBERT, BioTiny-BERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset via the Masked Language Modelling (MLM) objective. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create efficient lightweight models that perform on par with their larger counterparts. All the models will be publicly available on our Huggingface profile athttps://huggingface.co/nlpie
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- (4 more...)
- Overview (0.93)
- Research Report (0.82)